Compact trie clustering for overlap detection in news
نویسندگان
چکیده
We investigate document clustering through adaptation of Zamir and Etzioni’s approach to online web document clustering. Specifically we generalize the Suffix Tree Clustering method to allow for a wider range of clustering techniques. We apply the modified technique to a corpus of news articles improving precision by 29% while running 8% faster than the original algorithm.
منابع مشابه
A Genetic Approach to Tuning Compact Trie Clustering
The Compact Trie method for document clustering is sensitive to the kind of text it is applied to, but contains various parameters that may be tuned for adaptation to specific applications. We implement a genetic algorithm for optimizing these parameters and apply it to a corpus of texts to demonstrate the feasibility of using genetic algorithms for tuning.
متن کاملCompact Suffix Trees Resemble PATRICIA Tries: Limiting Distribution of the Depth
Suffix trees are the most frequently used data structures in algorithms on words. In this paper, we consider the depth of a compact suffix tree, also known as the PAT tree, under some simple probabilistic assumptions. For a biased memoryless source, we prove that the limiting distribution for the depth in a PAT tree is the same as the limiting distribution for the depth in a PATRICIA trie, even...
متن کاملTrie-Based Data Structures for Sequence Assembly
We investigate the application of trie-based data structures, suux trees and suux arrays in the problem of overlap detection in fragment assembly. Both data structures are theoretically and experimentally analyzed on speed and space. By using heuristics, we can greatly reduce the calls to the time-consuming dynamic programming, and have improved the speed of overlap detection up to 1,000 times ...
متن کاملCorrelation Clustering for Crosslingual Link Detection
The crosslingual link detection problem calls for identifying news articles in multiple languages that report on the same news event. This paper presents a novel approach based on constrained clustering. We discuss a general way for constrained clustering using a recent, graph-based clustering framework called correlation clustering. We introduce a correlation clustering implementation that fea...
متن کاملCompact Balanced Tries
summary by Mireille R egnier] Classical B?trees and preex B?trees 1] ooer both fast, direct addressing and easy sequential processing. They are balanced, segmented, and exible. Flexibility means that a B?tree leaf splitting may be done at any position inside the leaf. This property is emphasised: one generates and suppresses empty leaves, while forcing the other leaves to a 100% storage utilisa...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013